fix: streaming metrics and header parsing bugs #1627

LukeAVanDrie · 2025-09-20T03:16:54Z

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

This PR addresses several critical bugs in the endpoint picker that were discovered during refactoring work. These fixes are essential for stability, reliability, and observability.

The key changes are:

Fixes Critical Streaming Metrics Bug: For text/event-stream responses, token usage metrics are being undercounted. The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages. Furthermore, a context corruption bug caused the model_name label in Prometheus to be empty, rendering the metric unusable for per-model monitoring. This PR corrects the logic to accumulate tokens across the entire stream and ensures all metric labels are correctly populated.
Fixes Header Parsing Reliability: The EPP was brittle in how it parsed headers, exclusively checking the RawValue field and ignoring the Value field. This could lead to two severe production issues:
- Streaming Detection Failure: If a client sent content-type: text/event-stream using the Value field, the EPP would fail to detect the stream and attempt to buffer the entire response, risking high memory usage and OOM kills.
- Broken Distributed Tracing: The x-request-id header could be missed, breaking the distributed trace and hindering debugging.
  This PR makes all header parsing robust by correctly checking both fields.
Complete Hermetic Test Overhaul: The integration test suite has been refactored to be significantly more robust and maintainable. The previous shared-state, table-driven test has been replaced with a testHarness structure that provides full resource and state isolation for each test case, eliminating flakes and improving clarity.

Which issue(s) this PR fixes:
Fixes #1624
Fixes #1626

Does this PR introduce a user-facing change?:

- Fixes a critical bug where token usage metrics for streaming responses were undercounted and recorded with a missing `model_name` label.
- Fixes a bug that could cause the gateway to buffer streaming responses incorrectly.
- Improves the robustness of header parsing to prevent dropped `x-request-id` headers, ensuring distributed traces remain intact.

This commit addresses several critical bugs discovered in the External Processing Proxy (EPP) that impact reliability, observability, and correctness, particularly for streaming use cases. **Bug Fixes:** 1. **Correct Streaming Token Metrics:** - **Problem:** For streaming responses (e.g., `text/event-stream`), token usage metrics were recorded incorrectly. The logic only inspected the final `[DONE]` message for a `usage` block, failing to accumulate token counts from earlier messages in the stream. Additionally, the `IncomingModelName` was being overwritten by a blank value from the request body, causing the `model_name` label in Prometheus to be empty. - **Fix:** The response handler now correctly accumulates token counts from all streaming chunks into the `RequestContext`. The final, accurate count is recorded only when the `[DONE]` message is received. The request handler logic was reordered to ensure headers (containing the model name) are always processed before the body, preventing the context from being corrupted. 2. **Robust Header Parsing:** - **Problem:** Multiple locations in the codebase exclusively checked the `RawValue` field of an Envoy `HeaderValue` message, ignoring the valid `Value` field. This caused failures in detecting the `content-type` for streaming and loss of the `x-request-id` for tracing if a client sent them in the `Value` field. - **Fix:** All header parsing logic has been updated to check both `RawValue` and `Value`, making it robust and compliant with the Envoy API. **Refactoring:** - **Hermetic Test Overhaul:** The integration test suite in `test/integration/epp/hermetic_test.go` has been completely refactored for reliability and clarity. - The old, monolithic, table-driven test has been replaced with a `testHarness` structure that provides each test case with its own isolated server instance, Kubernetes resources (scoped by a unique label), and gRPC client. - This eliminates test interference and makes the suite significantly more stable and maintainable. While true parallelism is still blocked by a global metrics registry in controller-runtime, this change achieves full resource and state isolation. - Test cases are now grouped by functionality (`RequestRouting`, `ResponseHandling`, etc.) with clear, descriptive names. - All associated test utilities and documentation have been polished to improve readability and maintainability.

k8s-ci-robot · 2025-09-20T03:16:58Z

@LukeAVanDrie: The label(s) kind/test cannot be applied, because the repository doesn't have them.

In response to this:

What type of PR is this?

/kind bug
/kind test
/kind cleanup

What this PR does / why we need it:

This PR addresses several critical bugs in the endpoint picker that were discovered during refactoring work. These fixes are essential for stability, reliability, and observability.

The key changes are:

Fixes Critical Streaming Metrics Bug: For text/event-stream responses, token usage metrics are being undercounted. The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages. Furthermore, a context corruption bug caused the model_name label in Prometheus to be empty, rendering the metric unusable for per-model monitoring. This PR corrects the logic to accumulate tokens across the entire stream and ensures all metric labels are correctly populated.

Fixes Header Parsing Reliability: The EPP was brittle in how it parsed headers, exclusively checking the RawValue field and ignoring the Value field. This could lead to two severe production issues:

Streaming Detection Failure: If a client sent content-type: text/event-stream using the Value field, the EPP would fail to detect the stream and attempt to buffer the entire response, risking high memory usage and OOM kills.

Broken Distributed Tracing: The x-request-id header could be missed, breaking the distributed trace and hindering debugging.
This PR makes all header parsing robust by correctly checking both fields.

Complete Hermetic Test Overhaul: The integration test suite has been refactored to be significantly more robust and maintainable. The previous shared-state, table-driven test has been replaced with a testHarness structure that provides full resource and state isolation for each test case, eliminating flakes and improving clarity.

Which issue(s) this PR fixes:
Fixes #1624
Fixes #1626

Does this PR introduce a user-facing change?:
- Fixes a critical bug where token usage metrics for streaming responses were undercounted and recorded with a missing `model_name` label.
- Fixes a bug that could cause the gateway to buffer streaming responses incorrectly.
- Improves the robustness of header parsing to prevent dropped `x-request-id` headers, ensuring distributed traces remain intact.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-09-20T03:17:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LukeAVanDrie
Once this PR has been reviewed and has the lgtm label, please assign nirrozenbaum for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-09-20T03:17:04Z

Hi @LukeAVanDrie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

netlify · 2025-09-20T03:17:15Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`5fd9820`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68ce1ca931d43f0008b9d04f
😎 Deploy Preview	https://deploy-preview-1627--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

LukeAVanDrie · 2025-09-20T03:19:03Z

test/integration/util.go


-	requests = append(requests, headerReq)
-	requests = append(requests, GenerateRequest(logger, prompt, model, filterMetadata))
+	// Simulate a multi-chunk body by splitting the marshaled JSON.


This is a small improvement to better reflect actual streaming use cases.

LukeAVanDrie · 2025-09-20T03:19:48Z

test/integration/epp/hermetic_test.go

-		            inference_objective_input_tokens_bucket{model_name="",target_model_name="",le="+Inf"} 1
-		            inference_objective_input_tokens_sum{model_name="",target_model_name=""} 7
-		            inference_objective_input_tokens_count{model_name="",target_model_name=""} 1
+					inference_objective_input_tokens_bucket{model_name="sql-lora-sheddable",target_model_name="sql-lora-1fdg3",le="1"} 0


Verification that the labels are being applied properly now.

LukeAVanDrie · 2025-09-20T03:26:29Z

Here is a mapping of the original test cases in HEAD to the new, refactored test cases to verify that the refactoring of the hermetic test suite was not lossy. Hopefully this helps with reviewing as the delta is quite large.

All test scenarios are preserved. The only change in assertions is the intentional correction of the streaming metrics bug.

Original Test Case (`HEAD`)	Refactored Test Case (This PR)	Analysis
Request Routing & Scheduling Scenarios
`select lower queue and kv cache, no active lora`	`RequestRouting/selects pod with lower queue and kv cache`	Coverage Preserved. Assertions are identical.
`select active lora, low queue`	`RequestRouting/selects pod with active lora and low queue`	Coverage Preserved. Assertions are identical.
`select lora despite higher kv cache usage`	`RequestRouting/selects pod with lora affinity despite higher kv cache`	Coverage Preserved. Assertions are identical.
`don't shed requests by default`	`RequestRouting/routes to least-saturated pod when all pods are under high load`	Coverage Preserved. This tests the same "no shedding" behavior by ensuring the least-bad pod is chosen. Assertions are equivalent.
`body sent over multiple requests, noncritical...`	`RequestRouting/routes request with multi-chunk body`	Coverage Preserved. The core scenario (routing a request with a body split across multiple gRPC messages) is identical.
`inferenceobjective's modelName is not translated, passthrough`	`RequestRouting/passthrough for models not defined in objectives`	Coverage Preserved. Assertions are identical.
Response Handling Scenarios
`responsebody sent over multiple requests, content-type is json, buffer`	`ResponseHandling/buffers and rewrites multi-chunk json response`	Coverage Preserved. Assertions are identical.
`Response is invalid json; return body`	`ResponseHandling/handles invalid json in response body`	Coverage Preserved. Assertions are identical.
`responsebody sent over a single request, but empty body with EndOfStream...` (1st)	`ResponseHandling/handles single chunk response followed by empty EOS chunk`	Coverage Preserved. This correctly tests Envoy's behavior of sending an empty final chunk. Assertions are identical.
`responsebody sent over a single request, but empty body with EndOfStream...` (2nd)	`ResponseHandling/passes through and counts tokens in event-stream response`	Coverage Intentionally Changed (Bug Fix). The old test asserted the buggy behavior (empty `model_name` label). The new test asserts the correct, fully-populated labels.
Subsetting Scenarios
`select active lora with subsetting tag, all pods available`	`Subsetting/selects best pod from available subset`	Coverage Preserved. Assertions are identical.
`select active lora with subsetting tag, some pods match`	`Subsetting/selects only available pod in subset despite high load`	Coverage Preserved. Assertions are identical.
`select active lora with subsetting tag, no pods available`	`Subsetting/returns error when no pods match subset`	Coverage Preserved. Assertions are identical.
Error Condition Scenarios
`invalid json; return body (request-side)`	`ErrorConditions/invalid json in request body`	Coverage Preserved. Assertions are identical.
`request don't contains invalid payload, model not exist`	`ErrorConditions/request body is missing model field`	Coverage Preserved. Assertions are identical.
`no backend pods are available`	`ErrorConditions/no backend pods available in datastore`	Coverage Preserved. Assertions are identical.
Request Type Scenarios
`simple GET Request`	`RequestTypes/simple GET request is passed through`	Coverage Preserved. Assertions are identical.

ahg-g · 2025-09-20T18:05:15Z

/ok-to-test

k8s-ci-robot · 2025-09-20T18:24:33Z

@LukeAVanDrie: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-gateway-api-inference-extension-test-unit-main	`5fd9820`	link	true	`/test pull-gateway-api-inference-extension-test-unit-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ahg-g · 2025-09-20T18:13:22Z

pkg/epp/handlers/request.go

 func (s *StreamingServer) HandleRequestHeaders(reqCtx *RequestContext, req *extProcPb.ProcessingRequest_RequestHeaders) error {
 	reqCtx.RequestReceivedTimestamp = time.Now()

+	// Headers must be processed first to populate the request context as subsequent logic (like body processing) may


I am not following, where in this function do we use the reqCtx that would require moving this logic earlier? it is fine to move it up, but wanted to double check where the bug is in this specific case.

ahg-g · 2025-09-20T18:18:38Z

pkg/epp/handlers/request.go

+	// depend upon it.
+	for _, header := range req.RequestHeaders.Headers.Headers {
+		key := header.Key
+		// Per the Envoy API, a header's value can be in either the `RawValue` (bytes) or `Value` (string) field.


should we fix ExtractHeaderValue then? see https://github.com/kubernetes-sigs/gateway-api-inference-extension/blob/main/pkg/epp/util/request/headers.go#L35;

Probably best if we have a function for the if/else logic that we use across the code for reading the header value and name it ExtractHeaderValue and rename the current ExtractHeaderValue to ExtractCaseInsensitiveHeaderValue

Yes, this is a much cleaner solution, thanks! I missed this helper as I was tracing the request path. Most of these were discovered through hermetic test failures post-refactoring, and I am not deeply familiar with this part of the codebase. I will consolidate this into the helper.

ahg-g · 2025-09-20T18:22:47Z

pkg/epp/handlers/server.go

-				reqCtx.Request.Headers[requtil.RequestIdHeaderKey] = requestID // update in headers so director can consume it
 			}
+			// Ensure the request ID, whether pre-existing or newly generated, is in the context's header map.
+			// This makes it available to all downstream logic (e.g,. the director).


trying to understand if this is really a bug, in the case the header already existed, HandleRequestHeaders would have set it in the reqCtx.Request.Headers, right? I guess the only difference here is that we are unifying the header key as requtil.RequestIdHeaderKey since at line 187 the header could have had a different casing.

since at line 187 the header could have had a different casing

Let me look into this more. You may be onto something here. Perhaps there is a simpler solution / root cause for the behaviors I was seeing.

ahg-g · 2025-09-20T18:35:49Z

pkg/epp/handlers/response.go

+	// Parse the current chunk for a 'usage' block.
+	resp := parseRespForUsage(ctx, responseText)
+
+	// Accumulate token counts.


Two comments:

From metrics perspective, we are not truly accumulating anything, this is just reporting the values in the last block.

Which logic depends reqCtx.Usage while we are accumulating the stream that would require us to keep updating the usage rather than just doing it at the end?

The issue here was that the 'usage' block was not coming in the final "[DONE]" message in the hermetic testing environment. It was arriving in the penultimate message instead.

When processing the final message of the stream, it was empty, resulting in underreporting and empty labels. I need to do some more investigating to understand if this is an issue with the hermetic testing environment or something that can also occur with real traffic.

From metrics perspective, we are not truly accumulating anything, this is just reporting the values in the last block.

True, we just need to find the block that populates it. My comment is misleading as this is still a "detect and write once" process that is only reported after the stream ends.

I was trying to understand and debug behavior from integration test results, and I am not too familiar with our request stream handling code and the properties we can rely upon in production. It's very possible my root cause analysis is wrong in places, but the symptoms I reported in the linked issues are reproducible.

Let me write some targeted unit tests over the relevant components to better understand this.

The issue here was that the 'usage' block was not coming in the final "[DONE]" message in the hermetic testing environment. It was arriving in the penultimate message instead.

This might be an issue in the hermetic test. In real traffic, the penultimate and last message come in a same chunk:

data: {"id":"...","object":"text_completion","created":1739400043,"model":"food-review-0","choices":[], "usage":{"prompt_tokens":7,"total_tokens":17,"completion_tokens":10}} data: [DONE]

Great, thanks for confirming! I will fix the hermetic test setup to conform to this.

I think #1626 can be closed as WAI. I also reported an issue where metric labels were not being set properly, so I need to first confirm that this is solely triggered by the data: [DONE] final chunk case. If so, I'll revert my change to this response handling code.

nirrozenbaum · 2025-09-21T09:49:00Z

pkg/epp/handlers/request.go

+		if header.RawValue != nil {
+			reqCtx.Request.Headers[key] = string(header.RawValue)
+		} else {
+			reqCtx.Request.Headers[key] = header.Value
+		}


continuing @ahg-g line of thinking - I agree we should fix the ExtractHeaderValue helper func and then we can use it here, e.g.,:

reqCtx.Request.Headers[key] = ExtractHeaderValue(key)

I think the logic that handles RawValue or Value should be scoped to a single helper function.

nirrozenbaum · 2025-09-21T09:53:26Z

pkg/epp/handlers/server.go

+				var value string
+				if len(header.RawValue) > 0 {
+					value = string(header.RawValue)
+				} else {
+					value = header.Value
+				}


use the single helper function?
RawValue/Value should be encapsulated to ExtractHeaderValue helper func.

nirrozenbaum · 2025-09-21T09:53:46Z

pkg/epp/util/request/headers.go

+				// Per the Envoy API, a header's value can be in either the `RawValue` (bytes) or `Value` (string) field.
+				if len(headerKv.RawValue) > 0 {
+					return string(headerKv.RawValue)
+				}
+				return headerKv.Value


nirrozenbaum · 2025-09-21T09:54:43Z

test/integration/util.go

+// GetFreePort finds and returns an available TCP port on the host.
+// It works by asking the OS to allocate a port by listening on port 0, capturing the assigned address, and then
+// immediately closing the listener.
+func GetFreePort() (*net.TCPAddr, error) {


this is great

LukeAVanDrie · 2025-09-21T20:31:57Z

--- FAIL: TestFullDuplexStreamed_KubeInferenceObjectiveRequest (8.13s)
    --- FAIL: TestFullDuplexStreamed_KubeInferenceObjectiveRequest/ResponseHandling (0.67s)
        --- FAIL: TestFullDuplexStreamed_KubeInferenceObjectiveRequest/ResponseHandling/buffers_and_rewrites_multi-chunk_json_response (0.19s)
            util.go:86: Sending request: request_headers:{headers:{headers:{key:"x-gateway-inference-objective"  value:"sql-lora-sheddable"}  headers:{key:"x-gateway-model-name-rewrite"  value:"sql-lora-1fdg3"}  headers:{key:"x-request-id"  value:"test-static-id-1"}}  end_of_stream:true}
            util.go:86: Sending request: response_headers:{headers:{headers:{key:"content-type"  value:"application/json"}}}
            util.go:86: Sending request: response_body:{body:"{\"model\":\"sql-lora-sheddable\", \"prompt\": \"test\"}"}
            util.go:86: Sending request: response_body:{body:"}"  end_of_stream:true}
            util.go:118: Received response request_headers:{response:{header_mutation:{set_headers:{header:{key:"x-gateway-destination-endpoint"  raw_value:"192.168.1.1:8000"}}  set_headers:{header:{key:"x-request-id"  raw_value:"test-static-id-1"}}}  clear_route_cache:true}}  dynamic_metadata:{fields:{key:"envoy.lb"  value:{struct_value:{fields:{key:"x-gateway-destination-endpoint"  value:{string_value:"192.168.1.1:8000"}}}}}}
            util.go:118: Received response response_headers:{response:{header_mutation:{set_headers:{header:{key:"x-went-into-resp-headers"  raw_value:"true"}}  set_headers:{header:{key:"content-type"  raw_value:"application/json"}}}}}
            util.go:118: Received response response_body:{response:{body_mutation:{streamed_response:{body:"{\"model\":\"sql-lora-sheddable\", \"prompt\": \"test\"}}"  end_of_stream:true}}}}
            hermetic_test.go:947: Unexpected response, (-want +got):   []*ext_procv3.ProcessingResponse{
                  	Inverse(protocmp.Transform, protocmp.Message{"@type": s"envoy.service.ext_proc.v3.ProcessingResponse", "dynamic_metadata": protocmp.Message{"@type": s"google.protobuf.Struct", "fields": map[string]protocmp.Message{"envoy.lb": {"@type": s"google.protobuf.Value", "struct_value": protocmp.Message{"@type": s"google.protobuf.Struct", "fields": map[string]protocmp.Message{"x-gateway-destination-endpoint": {"@type": s"google.protobuf.Value", "string_value": string("192.168.1.1:8000")}}}}}}, "request_headers": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeadersResponse", "response": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.CommonResponse", "clear_route_cache": bool(true), "header_mutation": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeaderMutation", "set_headers": []protocmp.Message{{"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("x-gateway-destination-endpoint"), "raw_value": []uint8("192.168.1.1:8000")}}, {"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("x-request-id"), "raw_value": []uint8("test-static-id-1")}}}}}}}),
                  	Inverse(protocmp.Transform, protocmp.Message{"@type": s"envoy.service.ext_proc.v3.ProcessingResponse", "response_headers": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeadersResponse", "response": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.CommonResponse", "header_mutation": protocmp.Message{"@type": s"envoy.service.ext_proc.v3.HeaderMutation", "set_headers": []protocmp.Message(Inverse(cmpopts.SortSlices, []protocmp.Message{{"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("content-type"), "raw_value": []uint8("application/json")}}, {"@type": s"envoy.config.core.v3.HeaderValueOption", "header": protocmp.Message{"@type": s"envoy.config.core.v3.HeaderValue", "key": string("x-went-into-resp-headers"), "raw_value": []uint8("true")}}}))}}}}),
                  	Inverse(protocmp.Transform, protocmp.Message{
                  		"@type": s"envoy.service.ext_proc.v3.ProcessingResponse",
                  		"response_body": protocmp.Message{
                  			"@type": s"envoy.service.ext_proc.v3.BodyResponse",
                  			"response": protocmp.Message{
                  				"@type": s"envoy.service.ext_proc.v3.CommonResponse",
                  				"body_mutation": protocmp.Message{
                  					"@type": s"envoy.service.ext_proc.v3.BodyMutation",
                  					"streamed_response": protocmp.Message{
                  						"@type": s"envoy.service.ext_proc.v3.StreamedBodyResponse",
                  						"body": bytes.Join({
                  							`{"model":"sql-lora-sheddable",`,
                + 							" ",
                  							`"prompt":`,
                + 							" ",
                  							`"test"`,
                + 							"}",
                  							"}",
                  						}, ""),
                  						"end_of_stream": bool(true),
                  					},
                  				},
                  			},
                  		},
                  	}),
                  }
FAIL
coverage: [no statements]
FAIL	sigs.k8s.io/gateway-api-inference-extension/test/integration/epp	15.606s
	sigs.k8s.io/gateway-api-inference-extension/test/utils		coverage: 0.0% of statements
?   	sigs.k8s.io/gateway-api-inference-extension/version	[no test files]
FAIL
make: *** [Makefile:137: test] Error 1

This passes when I run make test or make test-integration locally. Interestingly, it fails in CI/CD environment for what appears to be minor formatting issues in the test expectations. Let me see how I can make this less brittle.

LukeAVanDrie · 2025-09-21T21:11:01Z

test/integration/epp/hermetic_test.go

The primary focus of this PR is a significant refactoring of the EPP hermetic test suite to achieve true test isolation. While it also includes important bug fixes that this new test rigor uncovered, the change here was necessary to unblock upcoming integration tests for the Flow Control layer, which require a clean and mutable environment for each test scenario.

Guidance for Reviewers

I would greatly appreciate your focus on the test refactoring itself:

Isolation Strategy: Does the combination of unique labels and a scoped cache seem robust and effective for serial execution?

Clarity & Maintainability: Is the new testHarness approach clearer and easier to extend?

Potential Risks: Do you see any potential downsides to this per-test setup approach?

I just want to make sure this doesn't get overlooked amidst the bug fix discussion as this is the heart of this PR.

Drawbacks of the Previous Approach

Previously, our tests relied on a shared, suite-level setup. The controller-runtime manager's cache was unscoped, meaning it watched all pods in the namespace. This design was fragile and prone to state leakage between test runs, making it difficult to write reliable tests for features that need to mutate backend state (e.g., marking pods as saturated/unsaturated).

The New testHarness

This PR overhauls this approach by introducing a testHarness that provides a fully encapsulated, hermetic environment for each test case.

Here’s a summary of the key isolation mechanisms:

Per-Test Harness: Each test now runs within its own testHarness instance, which encapsulates all necessary components (server, client, manager) and state.

Unique Resource Labeling: A unique test-run-id label (derived from a UUID) is generated for each harness and applied to all Kubernetes pods it creates.

Scoped Kubernetes Cache: This is the most crucial part. The controller-runtime manager within each harness is configured with a cache that only watches pods matching that specific, unique test-run-id label. This prevents the test from seeing or interacting with pods from any other test run.

Dedicated Lifecycle Management: Each harness starts its own server instance and creates its own gRPC client. A robust cleanup function, automatically called by t.Cleanup, uses a context.CancelFunc to gracefully tear down the server, manager, and client, and then deletes the uniquely labeled Kubernetes pods.

A Note on Parallelism

Despite this isolation, tests cannot use t.Parallel() due to controller-runtime's global metrics registry. Our solution for clean serial execution is resetting the registry post-test (t.Cleanup(metrics.Reset)). True parallelism would require per-test envtest instances, which is prohibitively slow. This refactoring is a pragmatic step, solving resource contention while sharing the suite-level envtest setup.

Thanks!

kfswain · 2025-09-22T03:32:52Z

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

LukeAVanDrie · 2025-09-22T03:41:29Z

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

This may actually be an issue in the test setup then where we are not accurately simulating the streaming use case. If this is never the case in production, I will update the test utils to reflect this and revert this particular change.

I changed the structure of how the tests are run (so they are executed in isolated environments) but not the assertions, so I was chasing down failures and may have root caused this incorrectly.

ahg-g · 2025-09-22T03:45:25Z

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

This may actually be an issue in the test setup then where we are not accurately simulating the streaming use case. If this is never the case in production, I will update the test utils to reflect this and revert this particular change.

I changed the structure of how the tests are run (so they are executed in isolated environments) but not the assertions, so I was chasing down failures and may have root caused this incorrectly.

I don't mind keeping the logic you have if we think it is more "resilient"

kfswain · 2025-09-22T04:10:30Z

The logic only parsed the final [DONE] message for a usage block, failing to accumulate token counts from earlier messages.

IIRC the usage field is always on the tail end of a streamed message, I remember going through these cases with Jeff. Why would the previous messages have a usage block?

This may actually be an issue in the test setup then where we are not accurately simulating the streaming use case. If this is never the case in production, I will update the test utils to reflect this and revert this particular change.
I changed the structure of how the tests are run (so they are executed in isolated environments) but not the assertions, so I was chasing down failures and may have root caused this incorrectly.

I don't mind keeping the logic you have if we think it is more "resilient"

Yes, agreed, that's fine.

I'm just walking through the risk assessment of these bugs, and weighing the risk of patching new logic, vs the risk of these bugs.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 20, 2025

k8s-ci-robot added kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 20, 2025

k8s-ci-robot requested review from liu-cong and robscott September 20, 2025 03:17

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 20, 2025

k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 20, 2025

LukeAVanDrie commented Sep 20, 2025

View reviewed changes

LukeAVanDrie mentioned this pull request Sep 20, 2025

v1.0.1 patch release #1616

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 20, 2025

ahg-g reviewed Sep 20, 2025

View reviewed changes

nirrozenbaum reviewed Sep 21, 2025

View reviewed changes

LukeAVanDrie commented Sep 21, 2025

View reviewed changes

LukeAVanDrie mentioned this pull request Sep 22, 2025

Bug: Streaming token metrics are undercounted and have a missing 'model_name' label #1626

Open

fix: streaming metrics and header parsing bugs #1627

Are you sure you want to change the base?

fix: streaming metrics and header parsing bugs #1627

Uh oh!

Conversation

LukeAVanDrie commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Sep 20, 2025

Uh oh!

k8s-ci-robot commented Sep 20, 2025

Uh oh!

k8s-ci-robot commented Sep 20, 2025

Uh oh!

netlify bot commented Sep 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie commented Sep 20, 2025

Uh oh!

ahg-g commented Sep 20, 2025

Uh oh!

k8s-ci-robot commented Sep 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukeAVanDrie commented Sep 21, 2025

Uh oh!

LukeAVanDrie Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfswain commented Sep 22, 2025

Uh oh!

LukeAVanDrie commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahg-g commented Sep 22, 2025

Uh oh!

kfswain commented Sep 22, 2025

Uh oh!

Uh oh!

LukeAVanDrie commented Sep 20, 2025 •

edited

Loading

netlify bot commented Sep 20, 2025 •

edited

Loading

LukeAVanDrie Sep 21, 2025 •

edited

Loading

LukeAVanDrie Sep 21, 2025 •

edited

Loading

nirrozenbaum Sep 21, 2025 •

edited

Loading

LukeAVanDrie Sep 21, 2025 •

edited

Loading

LukeAVanDrie commented Sep 22, 2025 •

edited

Loading